Diverse Topic Phrase Extraction from Text Collection
نویسندگان
چکیده
Keyword extraction is an efficient approach to managing an explosion of online text on the Web. Traditionally, an abstraction of the online text is constructed though keywords, which are extracted according to a certain importance measure. One such measure is their occurrence frequency. However, previous work has not considered another important factor: the diversity of the keywords. Therefore, the extracted keywords tend to crowd on one hot topic in the corpora while failing to cover other important topics. In this paper, we propose new algorithms to alleviate the disadvantages of these traditional methods for keyword extraction. Firstly, we propose to extract key phrases instead of keywords because phrases can effectively reduce the ambiguity of single words. Secondly, by leveraging latent semantic analysis, we can learn the related topics for each phrase as well as the distance among the phrases, so that the extracted phrases are able to cover more topics. To demonstrate the performance of our method, we conducted experiments on two open datasets: 20 Newsgroup and Reuters-21578.We design three novel evaluation metrics, based on which both qualitative and quantitative analyses shows that our proposed algorithm can be used to improve the key phrase extraction performance significantly.
منابع مشابه
Acquiring Topic Features to improve Event Extraction: in Pre-selected and Balanced Collections
Event extraction is a particularly challenging type of information extraction (IE) that may require inferences from the whole article. However, most current event extraction systems rely on local information at the phrase or sentence level, and do not consider the article as a whole, thus limiting extraction performance. Moreover, most annotated corpora are artificially enriched to include enou...
متن کاملAccurate Keyphrase Extraction from Scientific Papers by Mining Linguistic Information
In this paper we investigate the impact of candidate terms filtering using linguistic information on the accuracy of automatic keyphrase extraction from scientific papers. According to linguistic knowledge, the noun phrases are most likely to be keyphrases. However the definition of a noun phrase can vary from a system to another. We have identified five POS tag sequence definitions of a noun p...
متن کاملA survey on phrase structure learning methods for text classification
Text classification is a task of automatic classification of text into one of the predefined categories. The problem of text classification has been widely studied in different communities like natural language processing, data mining and information retrieval. Text classification is an important constituent in many information management tasks like topic identification, spam filtering, email r...
متن کاملAutomatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation
Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...
متن کاملConstructing Knowledge Maps of A Manager's Managerial Logic by A Text Mining Approach
The objective of this research is to represent the managerial logic of Mr. Yung-Ching Wang, the Chairman of Formosa Plastics Group (also known as the “God of Business” in Taiwan) through the construction of knowledge maps using a text-mining approach, including automatic key phrase extraction, term identification, document vector modeling, and a clustering method named growing hierarchical self...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005